
REMARKS 

The present invention is a method for deternnining the voicing of a speech 
signal segment and a device for determining the voicing of a speech signal segment. 
The method in accordance with the invention comprises dividing a speech signal 
segment into sub-segments as performed for example by segmentation function 106 
of Fig. 1 and as illustrated in Fig. 2, determining a value relating to the voicing of 
respective speech signal sub-segments as determined for example by value 
determination function 1 10 of Fig. 1 , comparing the values with a predetermined 
threshold and making a decision of the voicing of the speech segment based on the 
number values on one side of the threshold as performed, for example, by voice 
decision function 1 12 of Fig. 1 . The decision operates with emphasis on at least one 
last sub-segment of the segment. See paragraphs [0010] and [0036] of the 
Substitute Specification and the Abstract. The original specification teaches, as set 
forth in paragraph [0010] of the Substitute Specification, "[tjhis is done by 
emphasizing the voicing decisions of the last sub-segments in a frame to detect the 
transients from unvoiced to voiced speech... the frame is classified as voice also if all 
of a predetermined number of the last sub-segments have a normalized 
autocorrelation value exceeding the threshold value". 

Claim 6 stands objected to as set forth in Section 2 of the Office Action. 
Claim 6 has been amended as suggested by the Examiner. 

Claims 1-3 and 8-10 stand rejected under 35 U.S.C. §102 as being 
anticipated by United States Patent 5.734,789 (Swaminathan et al). Specifically, the 
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summation of sub-segments, which have been compared with items 15035 and 
17050 as noted by the Examiner, but do not disclose the claimed emphasis. There 
is no basis in the record why a person of ordinary skill in the art would be led to 
modify the teachings of Swaminathan to arrive at the subject matter of independent 
claims 1 and 8 including the emphasis on at least one last sub-segment. 

Claim 2 further limits claim 1 in reciting "wherein said step of making a 
decision is based on whether the value relating to the voicing of the last sub- 
segment is on the one side of the threshold and claim 9 further limits claim 8 in 
reciting wherein said means for making a decision comprises means for determining 
if a value of the last sub-segment is on the one side of the threshold". The 
Examiner's interpretation of the use of the last subframe value being weighed in 
Swaminathan is correct. However, it is submitted that the language of claims 2 and 
9 requires the last segment to be the decision making quantity which is not taught by 
Swaminathan since the multiple segments are merely weighed in the decision 
process with all sub-segments. 

Claims 3 and 10 respectively limit claims 1 and 8 in reciting making a decision 
is based on whether the values relating to the voicing of the last sub-segments are 
on the one side of the segment as set forth above with respect to claims 2 and 9. It 
is submitted that the Examiner's construction of Swaminathan is that the use of the 
last sub-segments meets the limitation of the claim. It is submitted that claims 3 and 
9 cannot properly construed in this manner in view of the limitations of claims 1 and 
8 regarding emphasis. 

Claims 4-7 and 11-14 stand rejected under 35 U.S.C. §103 as being 
unpatentable over Swaminathan in view of Hess. Hess is cited as teaching 
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"techniques for voicing deternnination where adjacent frames are checked a decision 
is used making a median smoother". Hess does not cure the deficiencies noted 
above with regard to Swaminathan not teaching emphasis on at least one last sub- 
segment of the segment as recited in independent claims 1 and 8. Accordingly, it is 
submitted that claims 4-7 and 11-14 are patentable for the reasons set forth above 
in that Hess does not cure the deficiencies noted with regard to Swaminathan and 
furthermore, the suggested modification of Swaminathan with Hess would not be 
made by a person of ordinary skill in the art for the reason that there is no reason 
why a person of ordinary skill in the art would be motivated to make the suggested 
modification except by impermissible hindsight. The Examiner seems to predicate 
Hess being in the same field of endeavor with that being sufficient motivation to 
modify Swaminathan. However, being in the same field of endeavor does not 
demonstrate motivation to combine. 

The specification has been amended to improve its form for reexamination. 

In view of the foregoing amendments and remarks, it is submitted that the 
application is in condition for allowance. Accordingly, early allowance thereof is 
respectfully requested. 

To the extent necessary, Applicants petition for an extension of time under 
37 C.F.R. §1.136. Please charge any shortage in fees due in connection with the 
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filing of this paper, including extension of time fees, to Deposit Account No. 01-2135 
(367.39387X00) and please credit any excess fees to such Deposit Account. 

Respectfully submitted, 

ANTONELLI, TERRY, STOUT & KRAUS, LLP 



Donald Er^tout 
Registration No. 26,422 
(703)312-6600 
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Field of the Invention 



rooon The present invention relates to speech processing, and more 
particularly to a voicing determination of the speech signal having a particular, but 
not exclusive, application to the field of mobile telephones. 



r00021 In known speech codecs the most common phonetic classification is a 
voicing decision, which classifies a speech frame as voiced or unvoiced. 
Generally speaking, voiced segments are typically associated with high local 
energy and exhibit a distinct periodicity corresponding to the fundamental 
frequency, or equivalently pitch, of the speech signal, whereas unvoiced 
segments resemble noise. However, a_speech signal also contains segments, 
which can be classified as a mixture of voiced and unvoiced speech where both 
components are present simultaneously. This category includes voiced fricatives 



Description of the Prior Art 




and breathy and creaky voices. The appropriate classification of mixed segments 
as either voiced or unvoiced depends on the properties of the speech codec. 

r00031 In a typical known analysis-by-synthesis (A-b-S) based speech codec^ 
the periodicity of speech is modelled with a pitch predictor filter, also referred to as 
a long-term prediction (LTP) filter. It charactoriooo characterizes t he harmonic 
structure of the spectrum based on the similarity of adjacent pitch periods in a 
speech signal. The most common method used for pitch extraction is the 
autocorrelation analysis, which indicates the similarity between the present and 
delayed speech segments. In this approach the lag value corresponding to the 
major peak of the autocorrelation function is interpreted as the pitch period. It is 
typical that for voiced speech segments with a clear pitch period the voicing 
determination is closely related to pitch extraction. 

SUMMARY OF THE INVENTION 

^According to a first aspect of the present invention there is provided a 

method for determining the voicing of a speech signal segment, comprising the 
steps of: dividing a speech signal segment into sub-segments, determining a 
value relating to the voicing of respective speech signal sub-segments, comparing 
said values with a predetermined threshold, and making a decision on the voicing 
of the speech segment based on the number of the values on one side of the 
threshold. 
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rOOOSI According to a second aspect of the present invention there is provided 
a device for determining the voicing of a speech signal segment, comprising 
means (106) for dividing a speech signal segment into sub-segments, means 
(110) for determining a value relating to the voicing of respective speech signal 
sub-segments, means (112) for comparing said values with a predetermined 
threshold and means (112; for making a decision on the voicing of the speech 
segment based on the number of the values on one side of the threshold. 

Summary of the I nvent i on 

rOOOei The invention provides a method for voicing determination to be used 
particularly, but not exclusively, in a narrow-band speech coding system. An aim 
of-the- The invention i s to addresses the problems of prior art by determining the 
voicing of the speech segment based on the periodicity of its sub-segments. The 
embodiments of the present invention give an improvement in the operation in a 
situation where the properties of the speech signal vary rapidly such that the 
single parameter set computed over a long window does not provide a reliable 
basis for voicing determination. 

r00071 A preferred embodiment of the voicing determination of the present 
invention divides a segment of speech signal further into sub-segments. Typically 
the speech signal segment comprises one speech frame. Furthermore, it may 
optionally include a possible lookahead which is a certain portion of the speech 
signal from the next speech frame. A normal i sod normalized autocorrelation is 
computed for each sub-segment. The norma l isod normalized autocorrelation 
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values of these- the sub-segments are forwarded to classification logic, which 
compares them- the sub-seaments to the predefined threshold value. In this 
embodiment, if a certain percentage of norma l is e d normalized autocorrelation 
values exceeds a threshold, the segment is classified as voiced. 

fOOOSI In one embodiment of the present invention, a norma li ood normalized 
autocorrelation is computed for each sub-segment using a window whose length 
is proportional to the estimated pitch period. This ensures that a suitable number 
of pitch periods is included to the window. 

f 00091 In addition to the above, a critical design problem in voicing 
determination algorithms is the correct classification of transient frames. This is 
especially true in transients from unvoiced to voiced speech as the energy of the 
speech signal is usually growing, if no separate algorithm is designed for 
classifying the transient frames, the voicing determination algorithm is always a 
compromise between the misclassification rate and the sensitivity to detecting 
transient frames appropriately. 

rOOIOI To improve the performance of the voicing determination algorithm 
during transient frames without increasing the misclassification rate practically at 
all, one embodiment of the present invention provides rules for classifying the 
speech frame as voiced. This is done by e mphasio i ng emphasizing the voicing 
decisions of the last sub-segments in a frame to detect the transients from 
unvoiced to voiced speech. That is. in addition to having a certain number of sub- 
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segments having a norma li s e d normalized autocorrelation value exceeding a 
threshold value, the frame is classified as voiced also if all of a predetermined 
number of the last sub-segments have a norma li sed normalized autocorrelation 
value exceeding the same threshold value. Detection of unvoiced to voiced 
transients is thus further improved by e mphas i s i ng emphasizing the last sub- 
segments in the classification logic. 

room The frame may be classified as voiced if only the last sub-segment has 
a norma li s e d normalized autocorrelation value exceeding the threshold value. 

r00121 A lternative! V, the frame may be classified as voiced if a portion of the 
subsegments out of the whole speech frame have a norma l is e d normalized 
autocorrelation value exceeding the threshold, The portion may, for example be 
substantially a half, or substantially a third of the sub-segments of the speech 
frame. 

r00131 The voiced/unvoiced decision can be used for two purposes. One 
option is to allocate bits within the speech codec differently for voiced and 
unvoiced frames. In general, voiced speech segments are perceptually more 
important than unvoiced segments and thus it is especially important that a 
speech frame is correctly classified as voiced. In the case of A-b-S type of codec, 
this can be done er^ for example bv re-allocating bits from the adaptive codebook 
(er^ ^for example from LTP-gain and LTP-lag parameters) to the excitation signal 
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when the speech frame is classified as unvoiced to improve the coding of the 
excitation signal. On the other hand the adaptive codebook in a speech codec can 
then be even switched off during the unvoiced speech frame which will lead to 
reduced total bit rate. Because of this on/off switching of LTP-parameters it is 
especially important that a speech frame is correctly classified as voiced. It has 
been noticed that, if a voiced speech frame is incorrectly classified as unvoiced 
and the LTP parameters are switched off. this leads to a decreased sound quality 
at the receiving end. Accordingly, the present invention provides a method and 
device for a voiced/unvoiced decision to make a reliable decision, especially, so 
that voiced speech frames are not incorrectly decided as unvoiced. 

Br ie f Descr i pt i on of tho Draw i ngs 

BRIEF DESCRIPTION OF THE DRAWINGS 
r00141 Exemplarv embodiments of the invention are hereinafter described with 
the reference to the accompanying drawings, in which: 

rOOISI fiiowe -Fig. 1 shows a block diagram of an apparatus of the present 
invention; 

r00161 Flqwe- Fiq. 2 shows a speech signal framing of the present invention; 
r00171 F4Qwe- Fig. 3 shows a flow diagram in accordance with the present 
invention; and 

r00181 F4Qyfe- Fiq. 4 shows a block diagram of a radiotelephone uti l is i ng 
utilizing the invention. 
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Deta i led Doscr i pt i on of the I nvont i on 

DETAILED DESCRIPTION OF THE DRAWINGS 
r00191 R ^we- Fiq. 1 shows a device I for voicing determination according to the 
first embodiment of the present invention. The device comprises a 
microphone 101 for receiving an acoustical signal 102, typically a voice signal, 
generated by a user, and converting it into an analog electrical signal at line 103. 
An AID converter 104 receives the analog electrical signal at line 103 and 
produces a digital electrical signal y(t) of the user's voice at line 105. A 
segmentation block 106 then divides speech signal to predefined sub-segments at 
line 107. A frame of 20 ms (160 samples) can for example divided into 4 sub- 
segments of 5 ms. After segmentation a pitch extraction block 108 extracts the 
optimum open-loop pitch period for each speech sub-segment. The optimum 
open-loop pitch is estimated by minimis i ng minimizing the sum-squared error 
between the speech segment and its delayed and gain-scaled version as 
following: 

J{UT,g{ty) =Y.^y{t + i)-g{t)y{t^i-r)f (1) 

where y(t) is the first speech sample belonging to the window of length N, t is the 
integer pitch period and g(t) is the gain. 
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r00201 The optimum value of g(t) is found by setting the partial derivative of the 
cost function (1 ) with respect to the gain equal to zero. This yields 

^(0 = ^ (2) 

R{t - t) 

where 

R(t,T)=Y,y(t-^i)y(t^i~T) (3) 

1+0 



is the autocorrelation of y(t) with delay r and 

Rit)=R(t,0)=Y,y\t-^i) (4) 



r002n By substituting the optimum gain to equation (1), the pitch period is 
estimated by maxim i s i ng maximizing the latter term of 

j(t,T)=Rit)-^^^^ (5) 

with respect to delay r. The pitch extraction block 108 is also arranged to send 
the above determined estimated open-loop pitch estimate t at line 113 to the 
segmentation block 106 and to a value determination block 110. An example of 
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the operation of the segmentation is shown in %we -Fig. 2. which is described 
later. 

r00221 The value determination block 110 also receives the speech signal y(t) 
from the segmentation block 106 at line 107. The value determination block 1 10 is 



r00231 To eliminate the effects of the negative values of the autocorrelation 
function when max i m i s i ng maximizing the function, a square root of the latter term 
of equation (5) is taken. The term to be max i m i s e d maximized is thus: 



r00241 During voiced segments^ the gain g(t) tends to be near unity and thus it 
is often used for voicing determination. However, during unvoiced and transient 
regions^ the gain g(t) fluctuates achieving also values near unity. A more robust 
voicing determination is achieved by observing the values of equation (6). To cope 
with the power variations of the signal, R(t,T) is normal i s e d normalized t o have a 
maximum value of unity resulting: 




C,{t,T) = Rit,r)l4R{t-r) 



(6) 



C,(/,r) = 



(7) 



^Rit)^R{t-T) 
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r00251 According to one aspect of the invention^ the window length in (7) is set 
to the found pitch period r plus some offset M to overcome the problems related to 
a fixed-length window. The periodicity measure used is thus 

C,(/,r) = -=i^^^ (8) 
^R^{t)^R^{t-r) 

where 

r+A/-l 

Kit,r)= Y.y^t+i')y^t = i-T) (9) 

1=0 

and 

r+A/-l 

^.(0=^.(^.0)= £3^'(f+o (10) 

(=0 

r002 4 1 r00261 The parameter M can be set, e.g. to 10 samples. A voicing 
decision block 112 is to receive the above determined periodicity measure C2(t, r) 
at line 111 from the value determination block 110 and parameters K. Ktr. Ctr to 
make the voicing decision. The decision logic of voiced/unvoiced decision is 
further described in f i gur e Fig. 3 below. 
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r00271 It should be e mphas i sod emphasized that the pitch period used in (8) 
can also be estimated in other ways than described in equations (1 ) - (6) above. A 
common modification is to use pitch tracking in order to avoid pitch multiples 
described in a Finnish patent application Fl 971976. Another optional function for 
the open-loop pitch extraction is that the effect of the formant frequencies is 
removed from the speech signal before pitch extraction. This can be done for 
example by a weighting filter. 

r00281 Modified signals erO^ for example a residual signal, weighted residual 
signal or weighted speech signal, can also be used for voicing determination 
instead of the original speech signal. R e s i dua l The residual signal is obtained by 
filtering the original speech signal by a_linear prediction analysis filter. 

r00291 It may also be advantageous to estimate the pitch period from the 
residual signal of the linear prediction filter instead of the speech signal, because 
the residual signal is often more clearly periodic. 

r00301 R e sidua l The residual signal can be further low-pass filtered and down- 
sampled before the above procedure. Down-sampling reduces the complexity of 
correlation computation. In one further example^ the speech signal is first filtered 
by a weighting filter before the calculation of autocorrelation is applied as 
described above. 
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rOOSn F i gur e Fig. 2 shows an example of dividing a speech frame into four 
subsegments whose starting positions are tl, t2. t3 and t4. The window lengths Nl, 
N2, N3 and N4 are proportional to the pitch period found as described above. The 
lookahead is also ut ili s e d utilized in the segmentation. In this example, the number 
of sub-segments is fixed. Alternatively the number of subsegments can variable 
based on the pitch period. This can be done for example by selecting the 
subsegments by t2=t1 + r + L, t3= t2 + t + L, etc. until all available data is ut ili s e d 
utilized . In this example L is constant and can be set e.g. — 10 resulting 
overlapping sub-segments. 

r00321 Figur e Fig. 3 shows a flow diagram of the method according to one 
embodiment of the present invention. The procedure is started by step 301 where 
the open-loop pitch period ---r is extracted as exemplified above in equations (1) — 
(6). At step 302 C2(t. t) is calculated for each sub-segment of the speech as 
described in equation (8). Next at step 303^ the number of sub-segments n is 
calculated where C2(t, r) is above a certain first threshold value Ctr- The 
comparator 304 determines whether the number of sub-segments n, determined 
at step 303, exceeds a certain second threshold value K. If the second threshold 
value K is exceeded the speech frame is classified as voiced. Otherwise the 
procedure continues to step 305. In this embodiment, at step 305 the comparator 
determines if a certain number Ktr of last subsegments have a value C2(t, r) 
exceeding the threshold Ctr- If the threshold is exceeded the speech frame is 
classified as a voiced frame. Otherwise the speech frame is classified as unvoiced 
frame. 
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r00331 The exact parameter values Ctr, Ktr and K presented above are not 
limited to certain values but are dependent on the system specified and can be 
selected empirically using a large speech database. For example, if the speech 
segment is divided into 9 sub-segments^ suitable values can be-erO r for example 
Ctr. = 0.6, Ktr= 4 and K= 6. An appropriate value of K and Ktr is proportional to the 
number of sub-segments. 

r00341 Alternatively, according to the present invention, the frame is classified 
as voiced if only the last sub-segment (i.e. Ktr = 1) has a norma li s e d normalized 
autocorrelation value exceeding the threshold value. According to still one 
modification the frame is classified as voiced if substantially half of the sub- 
segments out of the whole speech frame (e.g. 4 or 5 sub-segments out of 9) have 
a norma l is e d normalized autocorrelation value exceeding the threshold. 

[00351 Figur e Fig. 4 is a block figure of a radiotelephone d e scr i b i ng including 
the rele v a nt parts fe^-of the present invention. The radiotelephone comprises of a 
microphone 61, keypad 62, display 63, speaker 64 and antenna 71 with switch for 
duplex operation. Further included is a control unit 65, implemented for example in 
an ASIC circuit, for controlling the operation of the radiotelephone. F i gur e 3 Fig. 4 
also shows the transmission and reception blocks 67, 68 including speech 
encoder and decoder blocks 69, 70. The device for voicing determination 1 is 
preferably included within the speech encoder 69. Alternatively the voicing 
determination can be implemented separately, not within the speech encoder 89. 
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The speech encoder/decoder blocks 69, 70 and the voicing determination 1 can 
be implemented by a DSP circuit including the -known elements known a s suchv 
BtQt as internal/external memories and registers, for implementing the present 
invention. The speech encoder/decoder can be based on any standard/technology 
and the present invention thus forms one part for the operation of such codec. The 
radiotelephone itself can operate in any existing or future telecommunication 
standard based on digital technology. 

r00361 To improve the performance of the voicing determination algorithm, the 
last sub-segments are emphasized and specifically the performance of the voicing 
determination algorithm in unvoiced to voiced transients is emphasized including if 
all of a predetermined number of the last sub-segments have a normalized 
authorization value exceeding the same threshold value. 

r00371 In the view of foregoing description it will be evident to a person skilled 
in the art that various modifications may be made within the scope of the present 
invention. 

What is c l a i m e d i s 
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ABSTRACT 



This invention presents a voicing deternnination algorithm for classification of a 
speech signal segment as voiced or unvoiced. The algorithm is based on a 
normal i s e d normalized autocorrelation where the length of the window is 
proportional to the pitch period. The speech segment to be classified is further 
divided into a number of sub-segments, and the norma li s e d — normalized 
autocorrelation is calculated for each sub-segment If a certain number of the 
norma li s e d normalized autocorrelation values is above a predetermined threshold, 
the speech segment is classified as voiced. To improve the performance of the 
voicing determination algorithm in unvoiced to voiced transients, the norm ali s e d 
normalized autocorrelations of the last sub-segments are — e mphas i sed 
emphasized . The performance of the voicing decision algorithm can be enhanced 
by uti l is i ng utilizing also the possible lookahead information. 
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